{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_iYmMOWbwTPc"
   },
   "source": [
    "# Getting Started with Term Set Expansion\n",
    "\n",
    "Term set expansion is the task of expanding a given partial set of terms into a more complete set of terms that belong to the same semantic class. This tutorial demonstrates the capability of a corpus-based set expansion algorithm.\n",
    "\n",
    "Our approach is based on representing any term of a training corpus using word embeddings in order to estimate the similarity between the seed terms and any candidate term. Noun phrases provide good approximation for candidate terms and are extracted in our system using a noun phrase chunker. At expansion time, given a seed of terms, the most similar terms are returned.\n",
    "\n",
    "This tutorial shows how to run term set expansion given an NP2vec model (NP2vec model training Jupyter notebook available at nlp-architect/tutorials/NP2vec/NP2vec_training.ipynb). At the end of the tutorial, we show also how to run a simple web application for term set expansion.\n",
    "\n",
    "First let’s install NLP Architect and wget libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Fn5mh3dnwGwq"
   },
   "outputs": [],
   "source": [
    "!pip install nlp-architect\n",
    "!pip install wget"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "scmsmOyD0-yM"
   },
   "source": [
    "Let's import the required python classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pXaqnn5Gw0V0"
   },
   "outputs": [],
   "source": [
    "import wget\n",
    "import logging\n",
    "\n",
    "from nlp_architect.solutions.set_expansion.set_expand import SetExpand\n",
    "\n",
    "logger = logging.getLogger(__name__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's download pretrained np2vec model `enwiki-20171201_pretrained_set_expansion.txt`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://d2zs9tzlek599f.cloudfront.net/models/term_set/enwiki-20171201_pretrained_set_expansion.txt.tar.gz'\n",
    "wget.download(url, 'enwiki-20171201_pretrained_set_expansion.txt.tar.gz')\n",
    "import tarfile \n",
    "tarf = tarfile.open('enwiki-20171201_pretrained_set_expansion.txt.tar.gz') \n",
    "tarf.extractall()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's initialize SetExpand object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "se = SetExpand(np2vec_model_file='enwiki-20171201_pretrained_set_expansion.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We provide few examples of term set expansion with different seed sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consider 'signal processing, computer vision' as seed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seed_str = 'signal processing, computer vision'\n",
    "seed_list = seed_str.strip().split(',')\n",
    "exp = se.expand(seed_list)\n",
    "logger.info('Expanded results:')\n",
    "logger.info(exp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consider another seed 'Bayesian networks, support vector machines'. We get in the expanded set other models or algorithms related to machine learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seed_str = 'Bayesian networks, support vector machines'\n",
    "seed_list = seed_str.strip().split(',')\n",
    "exp = se.expand(seed_list)\n",
    "logger.info('Expanded results:')\n",
    "logger.info(exp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consider another seed 'San Francisco, Los Angeles'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seed_str = 'San Francisco, Los Angeles'\n",
    "seed_list = seed_str.strip().split(',')\n",
    "exp = se.expand(seed_list)\n",
    "logger.info('Expanded results:')\n",
    "logger.info(exp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also run the term set expansion as a simple web application.\n",
    "First, let's load the expand server with the trained model:\n",
    "python -m nlp_architect.solutions.set_expansion.expand_server enwiki-20171201_pretrained_set_expansion.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start the UI application:\n",
    "python -m nlp_architect.solutions.start_ui --solution set_expansion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After that, the term set expansion UI is a available at http://localhost:1234"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "Term_Set_Expansion.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
